Skip to content

Conversation

@JonPsson1
Copy link
Contributor

As the heavy spilling in Cactus was found to be caused by the
MachineScheduler, it has been clear that there is room for improvement in
GenericScheduler. A first attempt was made with #90181, which however did not
materialize in anything actually useful.

This time, a SystemZ specific MachineSchedStrategy has been developed, guided
by experimentation and benchmarking.

Finding that GenericScheduler fails to handle liveness despite all the work
it is doing with PressureSets etc, I tried a simpler approach by keeping
track of live registers and scheduling e.g. a load immediately if it closed a
live range. If this is always done first, it then seems to work well to also
consider latencies as a second step. The final piece that makes it all work
well is the distinction of tiny regions, which remedies seen problems with
comparison elimination, phys-reg copys and even compile time. These regions
(with up to 10 instructions) are mainly left as they were.

SPEC Results (omitting <1% changes):

f538.imagick_r      -13.34%
f507.cactuBSSN_r    -11.60% (*)
f526.blender_r       -3.82%
f544.nab_r           -1.99%
f508.namd_r          -1.82%
f519.lbm_r           -1.67%

(*) Improved 5% by disabling pre-RA scheduling.

geometric mean across all benchmarks: -1.81%

Number of Spill/Reload instructions : -5718 (-1%)
Number of register moves:             +1778 (+0.2%)

Compile time showed at first some heavy regressions that were due to a
large amount of LIS->getInterval() lookups in functions with many small
blocks (regions) and vregs. With the tiny regions handling however, this
problem went away and it seems to be now on par with GenericSched.
If needed there are likely things that could be sped up.

@llvmbot
Copy link
Member

llvmbot commented Apr 9, 2025

@llvm/pr-subscribers-llvm-regalloc

@llvm/pr-subscribers-backend-systemz

Author: Jonas Paulsson (JonPsson1)

Changes

As the heavy spilling in Cactus was found to be caused by the
MachineScheduler, it has been clear that there is room for improvement in
GenericScheduler. A first attempt was made with #90181, which however did not
materialize in anything actually useful.

This time, a SystemZ specific MachineSchedStrategy has been developed, guided
by experimentation and benchmarking.

Finding that GenericScheduler fails to handle liveness despite all the work
it is doing with PressureSets etc, I tried a simpler approach by keeping
track of live registers and scheduling e.g. a load immediately if it closed a
live range. If this is always done first, it then seems to work well to also
consider latencies as a second step. The final piece that makes it all work
well is the distinction of tiny regions, which remedies seen problems with
comparison elimination, phys-reg copys and even compile time. These regions
(with up to 10 instructions) are mainly left as they were.

SPEC Results (omitting <1% changes):

f538.imagick_r      -13.34%
f507.cactuBSSN_r    -11.60% (*)
f526.blender_r       -3.82%
f544.nab_r           -1.99%
f508.namd_r          -1.82%
f519.lbm_r           -1.67%

(*) Improved 5% by disabling pre-RA scheduling.

geometric mean across all benchmarks: -1.81%

Number of Spill/Reload instructions : -5718 (-1%)
Number of register moves:             +1778 (+0.2%)

Compile time showed at first some heavy regressions that were due to a
large amount of LIS->getInterval() lookups in functions with many small
blocks (regions) and vregs. With the tiny regions handling however, this
problem went away and it seems to be now on par with GenericSched.
If needed there are likely things that could be sped up.


Patch is 211.25 KiB, truncated to 20.00 KiB below, full version: https://github.com/llvm/llvm-project/pull/135076.diff

58 Files Affected:

  • (modified) llvm/include/llvm/CodeGen/MachineScheduler.h (+2)
  • (modified) llvm/lib/CodeGen/MachineScheduler.cpp (+26-25)
  • (modified) llvm/lib/Target/SystemZ/SystemZElimCompare.cpp (+5-37)
  • (modified) llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp (+28)
  • (modified) llvm/lib/Target/SystemZ/SystemZInstrInfo.h (+11)
  • (modified) llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp (+407-8)
  • (modified) llvm/lib/Target/SystemZ/SystemZMachineScheduler.h (+74-1)
  • (modified) llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp (+19)
  • (modified) llvm/lib/Target/SystemZ/SystemZTargetMachine.h (+4)
  • (modified) llvm/test/CodeGen/SystemZ/DAGCombiner_isAlias.ll (+4-4)
  • (modified) llvm/test/CodeGen/SystemZ/alias-01.ll (+2-2)
  • (modified) llvm/test/CodeGen/SystemZ/args-06.ll (+4-4)
  • (modified) llvm/test/CodeGen/SystemZ/args-12.ll (+1-1)
  • (modified) llvm/test/CodeGen/SystemZ/atomic-load-09.ll (+2-2)
  • (modified) llvm/test/CodeGen/SystemZ/atomic-store-08.ll (+4-4)
  • (modified) llvm/test/CodeGen/SystemZ/atomic-store-09.ll (+2-2)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fmax-01.ll (+4-4)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-fmin-01.ll (+4-4)
  • (modified) llvm/test/CodeGen/SystemZ/atomicrmw-ops-i128.ll (+3-3)
  • (modified) llvm/test/CodeGen/SystemZ/bswap-09.ll (+5-5)
  • (modified) llvm/test/CodeGen/SystemZ/bswap-10.ll (+5-5)
  • (modified) llvm/test/CodeGen/SystemZ/call-zos-vec.ll (+9-9)
  • (modified) llvm/test/CodeGen/SystemZ/dag-combine-05.ll (+10-10)
  • (modified) llvm/test/CodeGen/SystemZ/inline-asm-fp-int-casting-explicit-regs-zEC12.ll (+4-4)
  • (modified) llvm/test/CodeGen/SystemZ/inline-asm-fp-int-casting-zEC12.ll (+4-4)
  • (modified) llvm/test/CodeGen/SystemZ/int-conv-14.ll (+6-6)
  • (modified) llvm/test/CodeGen/SystemZ/int-mul-12.ll (+13-14)
  • (modified) llvm/test/CodeGen/SystemZ/int-mul-13.ll (+4-4)
  • (modified) llvm/test/CodeGen/SystemZ/int-mul-15.ll (+5-5)
  • (modified) llvm/test/CodeGen/SystemZ/int-uadd-14.ll (+19-19)
  • (modified) llvm/test/CodeGen/SystemZ/int-usub-13.ll (+25-25)
  • (modified) llvm/test/CodeGen/SystemZ/machine-combiner-reassoc-fp.ll (+96-96)
  • (modified) llvm/test/CodeGen/SystemZ/memcpy-01.ll (+2-2)
  • (modified) llvm/test/CodeGen/SystemZ/memset-01.ll (+2-2)
  • (added) llvm/test/CodeGen/SystemZ/misched-prera-biaspregs.mir (+87)
  • (added) llvm/test/CodeGen/SystemZ/misched-prera-copy-coal.mir (+31)
  • (added) llvm/test/CodeGen/SystemZ/misched-prera-latencies.mir (+167)
  • (added) llvm/test/CodeGen/SystemZ/misched-prera-loads.mir (+391)
  • (added) llvm/test/CodeGen/SystemZ/misched-prera-manystores-01.ll (+31)
  • (added) llvm/test/CodeGen/SystemZ/misched-prera-manystores-02.mir (+200)
  • (added) llvm/test/CodeGen/SystemZ/misched-prera-manystores-03.mir (+154)
  • (added) llvm/test/CodeGen/SystemZ/misched-prera-tinyregions.mir (+160)
  • (modified) llvm/test/CodeGen/SystemZ/regcoal_remat_empty_subrange.ll (+3-3)
  • (modified) llvm/test/CodeGen/SystemZ/rot-03.ll (+11-11)
  • (modified) llvm/test/CodeGen/SystemZ/shift-13.ll (+24-24)
  • (modified) llvm/test/CodeGen/SystemZ/shift-14.ll (+24-24)
  • (modified) llvm/test/CodeGen/SystemZ/shift-15.ll (+24-24)
  • (modified) llvm/test/CodeGen/SystemZ/shift-16.ll (+43-43)
  • (modified) llvm/test/CodeGen/SystemZ/shift-17.ll (+50-50)
  • (modified) llvm/test/CodeGen/SystemZ/store_nonbytesized_vecs.ll (+40-40)
  • (modified) llvm/test/CodeGen/SystemZ/vec-args-04.ll (+8-8)
  • (modified) llvm/test/CodeGen/SystemZ/vec-cmp-cmp-logic-select.ll (+151-151)
  • (renamed) llvm/test/CodeGen/SystemZ/vec-cmpsel-01.ll (+94-105)
  • (added) llvm/test/CodeGen/SystemZ/vec-cmpsel-02.ll (+70)
  • (modified) llvm/test/CodeGen/SystemZ/vec-eval.ll (+8-8)
  • (modified) llvm/test/CodeGen/SystemZ/vec-move-23.ll (+165-85)
  • (modified) llvm/test/CodeGen/SystemZ/vec-sub-01.ll (+10-10)
  • (modified) llvm/test/CodeGen/SystemZ/vector-constrained-fp-intrinsics.ll (+99-99)
diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index bc00d0b4ff852..8a819051f069a 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -1087,6 +1087,7 @@ class GenericSchedulerBase : public MachineSchedStrategy {
     NoCand,
     Only1,
     PhysReg,
+    LivenessReduce,
     RegExcess,
     RegCritical,
     Stall,
@@ -1218,6 +1219,7 @@ class GenericSchedulerBase : public MachineSchedStrategy {
 };
 
 // Utility functions used by heuristics in tryCandidate().
+unsigned computeRemLatency(SchedBoundary &CurrZone);
 bool tryLess(int TryVal, int CandVal,
              GenericSchedulerBase::SchedCandidate &TryCand,
              GenericSchedulerBase::SchedCandidate &Cand,
diff --git a/llvm/lib/CodeGen/MachineScheduler.cpp b/llvm/lib/CodeGen/MachineScheduler.cpp
index 97f27277aface..579f787b0af9b 100644
--- a/llvm/lib/CodeGen/MachineScheduler.cpp
+++ b/llvm/lib/CodeGen/MachineScheduler.cpp
@@ -3168,31 +3168,6 @@ initResourceDelta(const ScheduleDAGMI *DAG,
   }
 }
 
-/// Compute remaining latency. We need this both to determine whether the
-/// overall schedule has become latency-limited and whether the instructions
-/// outside this zone are resource or latency limited.
-///
-/// The "dependent" latency is updated incrementally during scheduling as the
-/// max height/depth of scheduled nodes minus the cycles since it was
-/// scheduled:
-///   DLat = max (N.depth - (CurrCycle - N.ReadyCycle) for N in Zone
-///
-/// The "independent" latency is the max ready queue depth:
-///   ILat = max N.depth for N in Available|Pending
-///
-/// RemainingLatency is the greater of independent and dependent latency.
-///
-/// These computations are expensive, especially in DAGs with many edges, so
-/// only do them if necessary.
-static unsigned computeRemLatency(SchedBoundary &CurrZone) {
-  unsigned RemLatency = CurrZone.getDependentLatency();
-  RemLatency = std::max(RemLatency,
-                        CurrZone.findMaxLatency(CurrZone.Available.elements()));
-  RemLatency = std::max(RemLatency,
-                        CurrZone.findMaxLatency(CurrZone.Pending.elements()));
-  return RemLatency;
-}
-
 /// Returns true if the current cycle plus remaning latency is greater than
 /// the critical path in the scheduling region.
 bool GenericSchedulerBase::shouldReduceLatency(const CandPolicy &Policy,
@@ -3278,6 +3253,7 @@ const char *GenericSchedulerBase::getReasonStr(
   case NoCand:         return "NOCAND    ";
   case Only1:          return "ONLY1     ";
   case PhysReg:        return "PHYS-REG  ";
+  case LivenessReduce: return "LIVE-REDUC";
   case RegExcess:      return "REG-EXCESS";
   case RegCritical:    return "REG-CRIT  ";
   case Stall:          return "STALL     ";
@@ -3351,6 +3327,31 @@ void GenericSchedulerBase::traceCandidate(const SchedCandidate &Cand) {
 #endif
 
 namespace llvm {
+/// Compute remaining latency. We need this both to determine whether the
+/// overall schedule has become latency-limited and whether the instructions
+/// outside this zone are resource or latency limited.
+///
+/// The "dependent" latency is updated incrementally during scheduling as the
+/// max height/depth of scheduled nodes minus the cycles since it was
+/// scheduled:
+///   DLat = max (N.depth - (CurrCycle - N.ReadyCycle) for N in Zone
+///
+/// The "independent" latency is the max ready queue depth:
+///   ILat = max N.depth for N in Available|Pending
+///
+/// RemainingLatency is the greater of independent and dependent latency.
+///
+/// These computations are expensive, especially in DAGs with many edges, so
+/// only do them if necessary.
+unsigned computeRemLatency(SchedBoundary &CurrZone) {
+  unsigned RemLatency = CurrZone.getDependentLatency();
+  RemLatency = std::max(RemLatency,
+                        CurrZone.findMaxLatency(CurrZone.Available.elements()));
+  RemLatency = std::max(RemLatency,
+                        CurrZone.findMaxLatency(CurrZone.Pending.elements()));
+  return RemLatency;
+}
+
 /// Return true if this heuristic determines order.
 /// TODO: Consider refactor return type of these functions as integer or enum,
 /// as we may need to differentiate whether TryCand is better than Cand.
diff --git a/llvm/lib/Target/SystemZ/SystemZElimCompare.cpp b/llvm/lib/Target/SystemZ/SystemZElimCompare.cpp
index 81f0014dd83f2..149a7b2f451d5 100644
--- a/llvm/lib/Target/SystemZ/SystemZElimCompare.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZElimCompare.cpp
@@ -151,30 +151,6 @@ Reference SystemZElimCompare::getRegReferences(MachineInstr &MI, unsigned Reg) {
   return Ref;
 }
 
-// Return true if this is a load and test which can be optimized the
-// same way as compare instruction.
-static bool isLoadAndTestAsCmp(MachineInstr &MI) {
-  // If we during isel used a load-and-test as a compare with 0, the
-  // def operand is dead.
-  return (MI.getOpcode() == SystemZ::LTEBR ||
-          MI.getOpcode() == SystemZ::LTDBR ||
-          MI.getOpcode() == SystemZ::LTXBR) &&
-         MI.getOperand(0).isDead();
-}
-
-// Return the source register of Compare, which is the unknown value
-// being tested.
-static unsigned getCompareSourceReg(MachineInstr &Compare) {
-  unsigned reg = 0;
-  if (Compare.isCompare())
-    reg = Compare.getOperand(0).getReg();
-  else if (isLoadAndTestAsCmp(Compare))
-    reg = Compare.getOperand(1).getReg();
-  assert(reg);
-
-  return reg;
-}
-
 // Compare compares the result of MI against zero.  If MI is an addition
 // of -1 and if CCUsers is a single branch on nonzero, eliminate the addition
 // and convert the branch to a BRCT(G) or BRCTH.  Return true on success.
@@ -207,7 +183,7 @@ bool SystemZElimCompare::convertToBRCT(
   // We already know that there are no references to the register between
   // MI and Compare.  Make sure that there are also no references between
   // Compare and Branch.
-  unsigned SrcReg = getCompareSourceReg(Compare);
+  unsigned SrcReg = TII->getCompareSourceReg(Compare);
   MachineBasicBlock::iterator MBBI = Compare, MBBE = Branch;
   for (++MBBI; MBBI != MBBE; ++MBBI)
     if (getRegReferences(*MBBI, SrcReg))
@@ -254,7 +230,7 @@ bool SystemZElimCompare::convertToLoadAndTrap(
   // We already know that there are no references to the register between
   // MI and Compare.  Make sure that there are also no references between
   // Compare and Branch.
-  unsigned SrcReg = getCompareSourceReg(Compare);
+  unsigned SrcReg = TII->getCompareSourceReg(Compare);
   MachineBasicBlock::iterator MBBI = Compare, MBBE = Branch;
   for (++MBBI; MBBI != MBBE; ++MBBI)
     if (getRegReferences(*MBBI, SrcReg))
@@ -495,25 +471,17 @@ bool SystemZElimCompare::adjustCCMasksForInstr(
   return true;
 }
 
-// Return true if Compare is a comparison against zero.
-static bool isCompareZero(MachineInstr &Compare) {
-  if (isLoadAndTestAsCmp(Compare))
-    return true;
-  return Compare.getNumExplicitOperands() == 2 &&
-    Compare.getOperand(1).isImm() && Compare.getOperand(1).getImm() == 0;
-}
-
 // Try to optimize cases where comparison instruction Compare is testing
 // a value against zero.  Return true on success and if Compare should be
 // deleted as dead.  CCUsers is the list of instructions that use the CC
 // value produced by Compare.
 bool SystemZElimCompare::optimizeCompareZero(
     MachineInstr &Compare, SmallVectorImpl<MachineInstr *> &CCUsers) {
-  if (!isCompareZero(Compare))
+  if (!TII->isCompareZero(Compare))
     return false;
 
   // Search back for CC results that are based on the first operand.
-  unsigned SrcReg = getCompareSourceReg(Compare);
+  unsigned SrcReg = TII->getCompareSourceReg(Compare);
   MachineBasicBlock &MBB = *Compare.getParent();
   Reference CCRefs;
   Reference SrcRefs;
@@ -702,7 +670,7 @@ bool SystemZElimCompare::processBlock(MachineBasicBlock &MBB) {
   MachineBasicBlock::iterator MBBI = MBB.end();
   while (MBBI != MBB.begin()) {
     MachineInstr &MI = *--MBBI;
-    if (CompleteCCUsers && (MI.isCompare() || isLoadAndTestAsCmp(MI)) &&
+    if (CompleteCCUsers && (MI.isCompare() || TII->isLoadAndTestAsCmp(MI)) &&
         (optimizeCompareZero(MI, CCUsers) ||
          fuseCompareOperations(MI, CCUsers))) {
       ++MBBI;
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
index 91a4aa9c73010..54a2adcc46198 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp
@@ -2144,6 +2144,34 @@ unsigned SystemZInstrInfo::getFusedCompare(unsigned Opcode,
   return 0;
 }
 
+bool SystemZInstrInfo::isLoadAndTestAsCmp(const MachineInstr &MI) const {
+  // If we during isel used a load-and-test as a compare with 0, the
+  // def operand is dead.
+  return (MI.getOpcode() == SystemZ::LTEBR ||
+          MI.getOpcode() == SystemZ::LTDBR ||
+          MI.getOpcode() == SystemZ::LTXBR) &&
+    MI.getOperand(0).isDead();
+}
+
+bool SystemZInstrInfo::isCompareZero(const MachineInstr &Compare) const {
+  if (isLoadAndTestAsCmp(Compare))
+    return true;
+  return Compare.isCompare() && Compare.getNumExplicitOperands() == 2 &&
+    Compare.getOperand(1).isImm() && Compare.getOperand(1).getImm() == 0;
+}
+
+unsigned SystemZInstrInfo::
+getCompareSourceReg(const MachineInstr &Compare) const {
+  unsigned reg = 0;
+  if (Compare.isCompare())
+    reg = Compare.getOperand(0).getReg();
+  else if (isLoadAndTestAsCmp(Compare))
+    reg = Compare.getOperand(1).getReg();
+  assert(reg);
+
+  return reg;
+}
+
 bool SystemZInstrInfo::
 prepareCompareSwapOperands(MachineBasicBlock::iterator const MBBI) const {
   assert(MBBI->isCompare() && MBBI->getOperand(0).isReg() &&
diff --git a/llvm/lib/Target/SystemZ/SystemZInstrInfo.h b/llvm/lib/Target/SystemZ/SystemZInstrInfo.h
index 8b82af61e669a..2030d52becc0e 100644
--- a/llvm/lib/Target/SystemZ/SystemZInstrInfo.h
+++ b/llvm/lib/Target/SystemZ/SystemZInstrInfo.h
@@ -356,6 +356,17 @@ class SystemZInstrInfo : public SystemZGenInstrInfo {
                            SystemZII::FusedCompareType Type,
                            const MachineInstr *MI = nullptr) const;
 
+  // Return true if this is a load and test which can be optimized the
+  // same way as compare instruction.
+  bool isLoadAndTestAsCmp(const MachineInstr &MI) const;
+
+  // Return true if Compare is a comparison against zero.
+  bool isCompareZero(const MachineInstr &Compare) const;
+
+  // Return the source register of Compare, which is the unknown value
+  // being tested.
+  unsigned getCompareSourceReg(const MachineInstr &Compare) const;
+
   // Try to find all CC users of the compare instruction (MBBI) and update
   // all of them to maintain equivalent behavior after swapping the compare
   // operands. Return false if not all users can be conclusively found and
diff --git a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
index 5e2365f1dc513..85376ec70edc5 100644
--- a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
@@ -5,22 +5,421 @@
 // SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 //
 //===----------------------------------------------------------------------===//
-//
-// -------------------------- Post RA scheduling ---------------------------- //
-// SystemZPostRASchedStrategy is a scheduling strategy which is plugged into
-// the MachineScheduler. It has a sorted Available set of SUs and a pickNode()
-// implementation that looks to optimize decoder grouping and balance the
-// usage of processor resources. Scheduler states are saved for the end
-// region of each MBB, so that a successor block can learn from it.
-//===----------------------------------------------------------------------===//
 
 #include "SystemZMachineScheduler.h"
+#include "llvm/CodeGen/LiveInterval.h"
+#include "llvm/CodeGen/LiveIntervals.h"
 #include "llvm/CodeGen/MachineLoopInfo.h"
 
 using namespace llvm;
 
 #define DEBUG_TYPE "machine-scheduler"
 
+/// Pre-RA scheduling ///
+
+static cl::opt<unsigned> TinyRegionLim(
+    "tiny-region-lim", cl::Hidden, cl::init(10),
+    cl::desc("Run limited pre-ra scheduling on regions of this size or "
+             "smaller. Mainly for testing."));
+
+static bool isRegDef(const MachineOperand &MO) {
+  return MO.isReg() && MO.isDef();
+}
+
+static bool isVirtRegDef(const MachineOperand &MO) {
+  return isRegDef(MO) && MO.getReg().isVirtual();
+}
+
+static bool isPhysRegDef(const MachineOperand &MO) {
+  return isRegDef(MO) && MO.getReg().isPhysical();
+}
+
+static bool isVirtRegUse(const MachineOperand &MO) {
+  return MO.isReg() && MO.isUse() && MO.readsReg() && MO.getReg().isVirtual();
+}
+
+void SystemZPreRASchedStrategy::initializePrioRegClasses(
+    const TargetRegisterInfo *TRI) {
+  for (const TargetRegisterClass *RC : TRI->regclasses()) {
+    for (MVT VT : MVT::fp_valuetypes())
+      if (TRI->isTypeLegalForClass(*RC, VT)) {
+        PrioRegClasses.insert(RC->getID());
+        break;
+      }
+
+    // On SystemZ vector and FP registers overlap: add any vector RC.
+    if (!PrioRegClasses.count(RC->getID()))
+      for (MVT VT : MVT::fp_fixedlen_vector_valuetypes())
+        if (TRI->isTypeLegalForClass(*RC, VT)) {
+          PrioRegClasses.insert(RC->getID());
+          break;
+        }
+  }
+}
+
+void SystemZPreRASchedStrategy::VRegSet::dump(std::string Msg) {
+  dbgs() << Msg.c_str();
+  bool First = true;
+  for (auto R : *this) {
+    if (!First)
+      dbgs() << ", ";
+    else
+      First = false;
+    dbgs() << "%" << R.virtRegIndex();
+  }
+  dbgs() << "\n";
+}
+
+unsigned SystemZPreRASchedStrategy::getRemLat(SchedBoundary *Zone) const {
+  if (RemLat == ~0U)
+    RemLat = computeRemLatency(*Zone);
+  return RemLat;
+}
+
+void SystemZPreRASchedStrategy::initializeStoresGroup() {
+  StoresGroup.clear();
+  FirstStoreInGroupScheduled = false;
+
+  unsigned CurrMaxDepth = 0;
+  for (unsigned Idx = DAG->SUnits.size() - 1; Idx + 1 != 0; --Idx) {
+    const SUnit *SU = &DAG->SUnits[Idx];
+    const MachineInstr *MI = SU->getInstr();
+    if (!MI->getNumOperands() || MI->isCopy())
+      continue;
+
+    bool HasVirtDef = false;
+    bool HasVirtUse = false;
+    for (unsigned I = 0; I < MI->getDesc().getNumOperands(); ++I) {
+      const MachineOperand &MO = MI->getOperand(I);
+      if (isVirtRegDef(MO) && !MO.isDead())
+        HasVirtDef = true;
+      else if (isVirtRegUse(MO) &&
+               MI->getDesc().operands()[I].OperandType != MCOI::OPERAND_MEMORY)
+        HasVirtUse = true;
+    }
+    bool IsStore = !HasVirtDef && HasVirtUse;
+
+    // Find a group of stores that all are at the bottom while avoiding
+    // regions with any additional group of lesser depth.
+    if (SU->getDepth() > CurrMaxDepth) {
+      CurrMaxDepth = SU->getDepth();
+      bool PrevGroup = StoresGroup.size() > 1;
+      StoresGroup.clear();
+      if (PrevGroup)
+        return;
+      if (IsStore)
+        StoresGroup.insert(SU);
+    }
+    else if (IsStore && !StoresGroup.empty() && SU->getDepth() == CurrMaxDepth) {
+      // The group members should all have the same opcode.
+      if ((*StoresGroup.begin())->getInstr()->getOpcode() != MI->getOpcode()) {
+        StoresGroup.clear();
+        return;
+      }
+      StoresGroup.insert(SU);
+    }
+  }
+
+  // Value of 8 handles a known regression (with group of 20).
+  // TODO: Would some other value be better?
+  if (StoresGroup.size() < 8)
+    StoresGroup.clear();
+}
+
+static int biasPhysRegExtra(const SUnit *SU) {
+  if (int Res = biasPhysReg(SU, /*isTop=*/false))
+    return Res;
+
+  // Also recognize Load Address of stack slot. There are (at least
+  // currently) no instructions here defining a physreg that uses a vreg.
+  const MachineInstr *MI = SU->getInstr();
+  if (MI->getNumOperands() && !MI->isCopy()) {
+    const MachineOperand &DefMO = MI->getOperand(0);
+    if (isPhysRegDef(DefMO))
+      return 1;
+  }
+
+  return 0;
+}
+
+int SystemZPreRASchedStrategy::
+computeSULivenessScore(SchedCandidate &C, ScheduleDAGMILive *DAG,
+                       SchedBoundary *Zone) const {
+  // Not all data deps are modelled around the SUnit - some data edges near
+  // boundaries are missing: Look directly at the MI operands instead.
+  const SUnit *SU = C.SU;
+  const MachineInstr *MI = SU->getInstr();
+  if (!MI->getNumOperands() || MI->isCopy())
+    return 0;
+
+  // Find uses of registers that are not already live (kills).
+  bool PrioKill = false;
+  bool GPRKill = false;
+  bool AddrKill = false;
+  bool HasPrioUse = false;
+  for (unsigned I = 0; I < MI->getDesc().getNumOperands(); ++I) {
+    const MachineOperand &MO = MI->getOperand(I);
+    if (!isVirtRegUse(MO))
+      continue;
+    HasPrioUse |= isPrioVirtReg(MO.getReg(), &DAG->MRI);
+    if (LiveRegs.count(MO.getReg()))
+      continue;
+    if (isPrioVirtReg(MO.getReg(), &DAG->MRI))
+      PrioKill = true;
+    else if (MI->getDesc().operands()[I].OperandType != MCOI::OPERAND_MEMORY)
+      GPRKill = true;
+    else
+      AddrKill = true;
+  }
+
+  // Find the interesting properties.
+  const MachineOperand &DefMO = MI->getOperand(0);
+  assert(!isPhysRegDef(DefMO) && "Did not expect physreg def!");
+  bool IsLoad =
+      isRegDef(DefMO) && !DefMO.isDead() && !IsRedefining[SU->NodeNum];
+  bool IsStore = (!isRegDef(DefMO) || DefMO.isDead());
+  // Prioritize FP: Ignore GPR/Addr kills with an FP def.
+  bool UsesLivePrio =
+      IsLoad && !PrioKill &&
+      (isPrioVirtReg(DefMO.getReg(), &DAG->MRI) || (!GPRKill && !AddrKill));
+  bool UsesLiveAll = !PrioKill && !GPRKill && !AddrKill;
+  bool PreservesSchedLat = SU->getHeight() <= Zone->getScheduledLatency();
+  const unsigned Cycles = 2;
+  unsigned Margin = SchedModel->getIssueWidth() * (Cycles + SU->Latency - 1);
+  bool HasDistToTop = NumLeft > Margin;
+
+  // Pull down a defining SU if it preserves the scheduled latency while not
+  // causing any (prioritized) register uses to become live. If however there
+  // will be relatively many SUs scheduled above this one and all uses are
+  // already live it should not be a problem to increase the scheduled
+  // latency given the OOO execution.
+  // TODO: Try schedulling small (DFSResult) subtrees as a unit.
+  bool SchedLow = IsLoad && ((PreservesSchedLat && UsesLivePrio) ||
+                             (HasDistToTop && UsesLiveAll));
+
+  // This handles regions with many chained stores of the same depth at the
+  // bottom in the input order (cactus). Push them upwards during scheduling.
+  bool SchedHigh = IsStore && FirstStoreInGroupScheduled &&
+                   StoresGroup.count(SU) &&
+                   (PrioKill || (!HasPrioUse && GPRKill));
+
+  if (SchedLow)
+    return -1;
+  if (SchedHigh)
+    return 1;
+  return 0;
+}
+
+bool SystemZPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
+                                             SchedCandidate &TryCand,
+                                             SchedBoundary *Zone) const {
+  assert(Zone && !Zone->isTop() && "Bottom-Up scheduling only.");
+
+  // Initialize the candidate if needed.
+  if (!Cand.isValid()) {
+    TryCand.Reason = FirstValid;
+    return true;
+  }
+
+  // Bias physreg defs and copies to their uses and definitions respectively.
+  int TryCandPRegBias = biasPhysRegExtra(TryCand.SU);
+  int CandPRegBias = biasPhysRegExtra(Cand.SU);
+  if (tryGreater(TryCandPRegBias, CandPRegBias, TryCand, Cand, PhysReg))
+    return TryCand.Reason != NoCand;
+  if (TryCandPRegBias && CandPRegBias) {
+    // Both biased same way.
+    tryGreater(TryCand.SU->NodeNum, Cand.SU->NodeNum, TryCand, Cand, NodeOrder);
+    return TryCand.Reason != NoCand;
+  }
+
+  if (TinyRegion) {
+    // Prioritize instructions that read unbuffered resources by stall cycles.
+    // TODO: Try this in bigger regions as well.
+    if (tryLess(Zone->getLatencyStallCycles(TryCand.SU),
+                Zone->getLatencyStallCycles(Cand.SU), TryCand, Cand, Stall))
+      return TryCand.Reason != NoCand;
+  } else {
+    // Look for an opportunity to reduce register liveness.
+    int TryCandScore = computeSULivenessScore(TryCand, DAG, Zone);
+    int CandScore = computeSULivenessScore(Cand, DAG, Zone);
+    if (tryLess(TryCandScore, CandScore, TryCand, Cand, LivenessReduce))
+      return TryCand.Reason != NoCand;
+
+    // Don't extend the scheduled latency.
+    if (ShouldReduceLatency && TryC...
[truncated]

@github-actions
Copy link

github-actions bot commented Apr 9, 2025

⚠️ C/C++ code formatter, clang-format found issues in your code. ⚠️

You can test this locally with the following command:
git-clang-format --diff origin/main HEAD --extensions cpp,h -- llvm/include/llvm/CodeGen/MachineScheduler.h llvm/include/llvm/CodeGen/RegisterPressure.h llvm/lib/CodeGen/MachineScheduler.cpp llvm/lib/Target/SystemZ/SystemZElimCompare.cpp llvm/lib/Target/SystemZ/SystemZInstrInfo.cpp llvm/lib/Target/SystemZ/SystemZInstrInfo.h llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp llvm/lib/Target/SystemZ/SystemZMachineScheduler.h llvm/lib/Target/SystemZ/SystemZTargetMachine.cpp llvm/lib/Target/SystemZ/SystemZTargetMachine.h --diff_from_common_commit

⚠️
The reproduction instructions above might return results for more than one PR
in a stack if you are using a stacked PR workflow. You can limit the results by
changing origin/main to the base branch/commit you want to compare against.
⚠️

View the diff from clang-format here.
diff --git a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
index fc47578dd..c8da1d711 100644
--- a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
@@ -331,7 +331,7 @@ bool SystemZPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
             : TryCand.SU->getHeight() < Cand.SU->getHeight() ? Cand.SU
                                                              : nullptr) {
       if (HigherSU->getHeight() > Zone->getScheduledLatency() &&
-          HigherSU->getDepth()  < getRemLat(Zone)) {
+          HigherSU->getDepth() < getRemLat(Zone)) {
         tryLess(TryCand.SU->getHeight(), Cand.SU->getHeight(), TryCand, Cand,
                 GenericSchedulerBase::BotHeightReduce);
         return TryCand.Reason != NoCand;

@JonPsson1
Copy link
Contributor Author

@preames @wangpc-pp @michaelmaitland @arsenm @asb

Any comments welcome.

@JonPsson1
Copy link
Contributor Author

Note:
The test vec-cmpsel.ll has been split into vec-cmpsel-01.ll and vec-cmpsel-02.ll.
vec-move-23.ll has been reformatted a bit by update_llc_test_checks.py

Copy link
Contributor

@wangpc-pp wangpc-pp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I am not an expert on SystemZ (know nothing about it), so I can only give some high level comments.

@@ -0,0 +1,200 @@
# RUN: llc -o - %s -mtriple=s390x-linux-gnu -mcpu=z196 -verify-machineinstrs \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pre-commit these new tests so that we can see the effect of new scheduling strategy.

Copy link
Member

@mshockwave mshockwave left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any insights on why and how the existing pressure set approach is insufficient?

@JonPsson1
Copy link
Contributor Author

Do you have any insights on why and how the existing pressure set approach is insufficient?

That's a good question... Maybe it is doing too little too late - it waits for certain limits to be reached but then it has already opened up many register intervals and for some reason it isn't enough to try to reduce register pressure at that point. It seems to work better to reduce liveness regardless of the current pressure, that is as soon as possible.

I am not sure if the PSets themselves are overly complex and get in the way of things: On SystemZ, the simple approach I am using is to consider the fact of basically two sets of registers available: FP regs (which overlap the vector regs), and the GPR regs. The idea is to reduce FP pressure as much as possible by ignoring GPR regs in FP intensive regions. I see 5 PSets on SystemZ, but I am not sure that works the same as having just two, but ideally it should. Prioritizing FP regs seems to work well, and is maybe an important difference to the GenericScheduler.

Another difference is that it seems to work better not to consider memory operands (GPR registers used to hold addresses), as they are not naturally part of the data flow. This is something PSets don't know about - how the register is actually used. But I don't think that is a major point in all this.

I have also seen the handling of critical PSets go wrong with a large region which is very FP intensive. In the input order, it is the integer sets that have the most pressure, but there is a lot of potential FP overlap which results after scheduling (with increased heavy spilling). I tried fixing this however by providing the correct critical PSets, but it did not help much on that test case. Another option I haven't tried (suggested by @atrick, I think) is to have a second pass which could among other things maybe correct this if it goes wrong. This I haven't tried however.

@JonPsson1
Copy link
Contributor Author

A patch with scripts that I have been working with as a tool for developing this strategy can be found here: #136483.

@atrick Any comments from you (and others ofc) would be very welcome before committing this. Are there any "no-no":s in this strategy - for instance related to compile time? This strategy is now at a good point, but that doesn't mean that things can get
better. Do you have any ideas at this point that migh be worth trying?

@lhames

@JonPsson1
Copy link
Contributor Author

@uweigand regarding compile time (per pass as reported by -ftime-report):

  • zig.bc: slightly worse but still relatively fast: 2.2% vs 1.7%.

  • llvm test-suite build:
    SystemZPreRASchedStrategy: Average: 0.98%, worst: 26.5%.
    GenericScheduler: Average: 0.88%, worst: 25%.

So far no compile time explosions seen. One speedup that might be worth trying (in initialize()) would be to find the live-in regs by testing regs touched in the region only (as opposed to testing all of them). But since there are no known cases of this being significantly slow, maybe that can wait, or?

@JonPsson1
Copy link
Contributor Author

@arsenm Thanks for review - patch updated accordingly.

@JonPsson1
Copy link
Contributor Author

ping - any thoughts / objections before committing this?

@JonPsson1
Copy link
Contributor Author

Rebase.

@JonPsson1
Copy link
Contributor Author

@uweigand

I have made some progress around the reuse of existing common-code instead of using the LiveReg set I added:

  • add a public isRegLive(Reg) method to RegPressureTracker to be able to check for a vregs liveness instead of using the previously added VRegSet for this purpose (NFC).

  • It seems possible to use the PDiff of an SU to infer the same answers to liveness questions as by iterating over the operands. Using PDiffs is at least NFC on SPEC, so it seems to work. This commit has both alternatives, with a CL option to chose, to make a comparison easy. Next step is to decide which method is preferred.

Pros/Cons (PDiff vs operands liveness):

operands/registers liveness:

  • More intuitive, when working with scheduler and test cases, as a reg/operand corresponds directly to liveness. Perhaps needed for future improvements?

PDiffs:

  • Already precomputed by common-code.
  • Need to find the right pressure-sets (initializePressureSets), which may change with TableGen.
  • A bit awkward to go back from the generalization of PressureSets to operand liveness.

@lhames
Copy link
Contributor

lhames commented Aug 18, 2025

@lhames

Sorry -- I'm way out of the loop on register allocation and can't be any help here.

@uweigand
Copy link
Member

Hi @JonPsson1, I've started looking into this and have a few high-level questions:

  • As to latency - common code uses logic in shouldReduceLatency based on critical path length to decide whether or not latency is important right now. What is the difference between that common code method and the custom check you're implementing here?
  • As to register pressure - you're now using the pressure difference per SU computed by common code, but not the actual pressure itself at this point (which is also already computed by common code). Wouldn't that be important to consider? It doesn't really matter if an instruction would use one more vector register if we still have 20 unused at this point, does it?

@JonPsson1
Copy link
Contributor Author

As to latency - common code uses logic in shouldReduceLatency based on critical path length to decide whether or not latency is important right now. What is the difference between that common code method and the custom check you're implementing here?

shouldReduceLatency() can give a decision before each picked SU based on the remaining critical path in the given moment. This is based on the current cycle and remaining latency. The HasDataSequences I implemented looks at a region before scheduling and then triggers latency reduction across the entire region - before each picked node.

I have generally tried to move away from the cycle-based scheduling heuristics used by GenericScheduler as they mostly make exact sense for an in-order target during pre-RA scheduling, I'd say. I could however not really make the argument for one or the other except by trying it out, which I did:

First, I examined how many times in SystemZPreRASchedStrategy::tryCandidate() at the point of interest (below liveness reduction) these would yield "true":

Total                 : 4.8M queries
shouldReduceLatency() : 3M   
HasDataSequence       : 200K 
HasDataSequence && shouldReduceLatency: 42K  

In other words, 4.8M times (note that this number includes all the iterations across the Available queue) would be "always true", barred the initial check to exclude some very "wide" DAGs. This is what was done before improving this recently. shouldReduceLatency() cuts that almost in half, while HasDataSequence is relatively rare.

When I noticed a regression after returning to this project on lbm, I tried disabling this latency heuristic alltogether, and saw these results:

patch <> patch without latency heuristic
Improvements:
0.965: f519.lbm_r 
0.976: i505.mcf_r 
0.981: f544.nab_r 
Regressions:
1.052: f508.namd_r 
1.051: f507.cactuBSSN_r 
1.026: f526.blender_r 

The lbm problem went away, but other benchmarks suffered.

After working with this and adding the (HasDataSequences || Rem.IsAcyclicLatencyLimited) guard (per the patch posted here currently), I see these numbers:

main <> patch
Improvements:
0.885: f507.cactuBSSN_r 
0.963: f544.nab_r 
0.974: f526.blender_r 
0.987: f508.namd_r 
0.990: i531.deepsjeng_r 
0.998: i505.mcf_r 
Regressions:
1.007: i523.xalancbmk_r 
1.004: f519.lbm_r

Now, back to the experiment with shouldReduceLatency(). I tried two changes: using shouldReduceLatency() instead of HasDataSequences, or both combined like (HasDataSequences && shouldReduceLatency()).

There is a slight (+2000 spills/reloads) increase in spilling with shouldReduceLatency(), which is sort of as expected.

Performancewise, it seems to e.g. handle cactus equally well, but there are slight losses on two benchmarks:

patch <> patch using shouldReduceLatecy()
Regressions:
1.019: f526.blender_r 
1.010: i500.perlbench_r 
1.008: i505.mcf_r 
patch <> patch using (HasDataSequences && shouldReduceLatency())
Regressions:
1.028: f526.blender_r 
1.019: i500.perlbench_r 

Given the results, it seems better to schedule more agressively for latency throughout the region than to do it per the cycle based critical path heuristic - if done only on the regions where this matters. I get some confidence in HasDataSequence as it gives better performance with a lot less involvement, and also better than the combination of the two.

diff --git a/llvm/include/llvm/CodeGen/MachineScheduler.h b/llvm/include/llvm/CodeGen/MachineScheduler.h
index a0ef475..c21bb33 100644
--- a/llvm/include/llvm/CodeGen/MachineScheduler.h
+++ b/llvm/include/llvm/CodeGen/MachineScheduler.h
@@ -1224,7 +1224,7 @@ protected:
   void traceCandidate(const SchedCandidate &Cand);
 #endif
 
-private:
+protected:
   bool shouldReduceLatency(const CandPolicy &Policy, SchedBoundary &CurrZone,
                            bool ComputeRemLatency, unsigned &RemLatency) const;
 };
diff --git a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
index a3df7fc..dff2700 100644
--- a/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
+++ b/llvm/lib/Target/SystemZ/SystemZMachineScheduler.cpp
@@ -344,7 +344,14 @@ bool SystemZPreRASchedStrategy::tryCandidate(SchedCandidate &Cand,
     // Don't extend the scheduled latency in regions with many nodes in
     // simple data sequences, or for (single block loop) regions that are
     // acyclically (within a single loop iteration) latency limited.
-    if ((HasDataSequences || Rem.IsAcyclicLatencyLimited) &&
+
+    // EXPERIMENT: Use GenericSchedulerBase::shouldReduceLatency() instead of HasDataSequences
+    CandPolicy P;
+    getRemLat(Zone);
+    bool GenericShouldReduceLat = shouldReduceLatency(P, *Zone, false, RemLat);
+    if (((HasDataSequences && GenericShouldReduceLat) || Rem.IsAcyclicLatencyLimited) &&
+ // if ((GenericShouldReduceLat || Rem.IsAcyclicLatencyLimited) &&
+ // if ((HasDataSequences || Rem.IsAcyclicLatencyLimited) &&
         TryCand.SU->getHeight() != Cand.SU->getHeight() &&
         (std::max(TryCand.SU->getHeight(), Cand.SU->getHeight()) >
          Zone->getScheduledLatency())) {

@JonPsson1
Copy link
Contributor Author

As to register pressure - you're now using the pressure difference per SU computed by common code, but not the actual pressure itself at this point (which is also already computed by common code). Wouldn't that be important to consider? It doesn't really matter if an instruction would use one more vector register if we still have 20 unused at this point, does it?

First of all, per the comment in RegisterPressure.cpp: "The register tracker is unaware of global liveness so ignores normal live-thru ranges...", there is (currently) no truly accurate awareness of the number of used / free registers.

The OOO reasoning I am using to motivate e.g. pulling a VL down right before its user is that this shouldn't matter if there are many (30) instructions above the VL, in a big region (HasDistToTop). And in the case where the scheduled latency is not increased by the VL (another SU at least as high was already scheduled below it) this is also less likely to matter (PreservesSchedLat).

I now made the experiment to try your idea, in two versions: The first one checks for 2 free regs, the second one for the pressure to be below half of the limit (it doesn't schedule the SU "low" if the register pressure is ok). If the current pressure was perfectly estimated, at least the second one should give no more spilling. I did get a few more spills/reloads with both versions:

patch <> "Pressure + 2 regs < Lim":  +3K spills/reloads
Improvements
0.988: i525.x264_r 
0.991: f538.imagick_r 
0.991: f519.lbm_r 
Regressions
1.035: f544.nab_r 
1.009: i500.perlbench_r 
1.006: f507.cactuBSSN_r 
patch <> "Pressure <= Lim / 2":  +1K spills/reloads
Improvements
0.990: f519.lbm_r 
Regressions 2017_L_Exp_PSetLimHalf:
1.036: f544.nab_r 
1.007: i505.mcf_r 
1.006: i523.xalancbmk_r 

It doesn't seem to give much one way or the other performance-wise... (I also tried another version using the "2 regs rule" only in cases where latency reduction is not enabled, with similar mixed results). Both of these show f544.nab_r regressing. I tried this again on this benchmark, but only when latency reduction is not enabled and then the regression dissapeared. There were however only 5 more spill/reloads causing the 3.6% regression, so it's likely not a spilling problem, so not sure exactly what is the cause.

I agree that this should in theory not hurt if done correctly, and maybe GenericScheduler would do better if it set up the global liveness also. At least it shouldn't have to cause a heavy increase in spilling given that it checks the pressure 'Excess' very early. Another problem in the GenericScheduler relating to this is however that RegPressureTracker:: getUpwardPressureDelta() that is used seems to just track one PressureSet that exceeds its limit. That means e.g. GRX32Bit could be tracked, but then VR16Bit ignored, IIUC. I think it would have to be able to check any Excess that the PressureDiff of an SU would affect. So with global liveness and better handling of pressure Excess it should do better, but then again it will not matter in many interesting cases where there will be Excess anyway where it will always be a matter of reducing it as much as possible. @atrick - have you tried this?

@atrick has described the problem I saw in GenericScheduler as "backing itself into a corner". So the idea for me as been to not be so confident about the overall picture but rather do more as the opportunity arises. First reg-pressure as much as possible, and only after that latency reduction. For bigger regions (like in cactus), there will always be some spilling so in those cases it always makes sense to close the live range, and it seems to give good performance to do it under the given checks.

EXPERIMENT:
@@ -256,6 +258,9 @@ int SystemZPreRASchedStrategy::computeSULivenessScore(
     UsesLiveAll = !PrioKill && !GPRKill;
     StoreKill = (PrioKill || (!HasPrioUse && GPRKill));
   } else if (MO0.isReg() && MO0.getReg().isVirtual()) {
+    const RegPressureTracker &RPTracker = DAG->getBotRPTracker();
+    ArrayRef<unsigned> Pressure = RPTracker.getRegSetPressureAtPos();
+    bool InFPReg = false;
     int PrioPressureChange = 0;
     int GPRPressureChange = 0;
     const PressureDiff &PDiff = DAG->getPressureDiff(SU);
@@ -266,12 +271,40 @@ int SystemZPreRASchedStrategy::computeSULivenessScore(
         PrioPressureChange += PC.getUnitInc();
       else if (PC.getPSet() == GPRPressureSet)
         GPRPressureChange += PC.getUnitInc();
+      if (PC.getPSet() == SystemZ::FP16Bit)
+        InFPReg = true;
     }

     if (IsLoad) {
       bool PrioDefNoKill = PrioPressureChange == -RegWeight;
       bool GPRDefNoKill = GPRPressureChange == -RegWeight;
+
+      unsigned PSet = ~0;
+      unsigned RegFactor = 1;
+      if (PrioDefNoKill)
+        PSet = InFPReg ? SystemZ::FP16Bit : SystemZ::VR16Bit;
+      else if (GPRDefNoKill) {
+        PSet = SystemZ::GRX32Bit;
+        RegFactor = 2;
+      }
+      if (PSet != ~0) {
+        unsigned Lim = RPTracker.getRCI()->getRegPressureSetLimit(PSet);
+        if (!RPTracker.getLiveThru().empty())
+          Lim += RPTracker.getLiveThru()[PSet];
+
+        // EXPERIMENTAL:
+        // Guard against some other SU being scheduled instead that could
+        // cause 2 registers to become live. Skip this if it would still be
+        // below the limit.
+        // if (Pressure[PSet] + (2 * RegFactor) < Lim)
+        //   return 0;
+        // ~ or ~
+        // Alternatively, skip if pressure is low.
+        // if (Pressure[PSet] <= Lim / 2)
+        //   return 0;
+      }
+
       UsesLivePrio =

@JonPsson1
Copy link
Contributor Author

Patch rebased.

  • Added an option for experimentation with latency reduction (-prera-lat-red). The default is the same as before, but there is also always/more/never/generic.

  • LBM has now fluctuated back to being slower on main, so the regression handling on that has now become an improvement against main. The older version that had latency reduction enabled more and showed the regression is now equal with main. Imagick has likewise also reverted on main to being slower, while remaining faster with patch.

  • Updated tests so that they are now passing also with latest changes.

@JonPsson1
Copy link
Contributor Author

JonPsson1 commented Sep 9, 2025

Some minor fixes. It is all looking quite good now to me so the next step would be to decide on using PressureDiffs (or not).

Given that the PressureDiffs give identical results, they make sense to me, I guess. I have now simplified SystemZPreRASchedStrategy::initializePressureSets() which should be ok, given that e.g. GCNSchedStrategy::pickNodeFromQueue() is hard-coding these sets by name. There is also a new SystemZ test for the PressureSets just in case they would change, coming from TableGen.

@JonPsson1
Copy link
Contributor Author

These are the latest real bencharking results, which show that this seems to have good impact on the spec-FP.

f503.bwaves_r                    818.26s             817.67s                   -0.07%
f507.cactuBSSN_r                 323.14s             282.89s                  -12.46%
f508.namd_r                      208.29s             204.09s                   -2.02%
f510.parest_r                    505.40s             503.92s                   -0.29%
f511.povray_r                    568.38s             566.56s                   -0.32%
f519.lbm_r                       228.51s             222.96s                   -2.43%
f521.wrf_r                       411.52s             411.08s                   -0.11%
f526.blender_r                   267.38s             257.95s                   -3.53%
f527.cam4_r                      306.89s             308.40s                    0.49%
f538.imagick_r                   371.98s             324.00s                  -12.90%
f544.nab_r                       356.02s             344.27s                   -3.30%
f549.fotonik3d_r                 333.67s             333.22s                   -0.13%
f554.roms_r                      219.66s             219.54s                   -0.05%
i500.perlbench_r                 307.38s             308.42s                    0.34%
i502.gcc_r                       169.78s             169.24s                   -0.32%
i505.mcf_r                       283.45s             282.39s                   -0.37%
i520.omnetpp_r                   210.89s             211.89s                    0.47%
i523.xalancbmk_r                 180.09s             181.13s                    0.58%
i525.x264_r                      192.52s             193.03s                    0.26%
i531.deepsjeng_r                 257.37s             257.22s                   -0.06%
i541.leela_r                     478.44s             480.84s                    0.50%
i548.exchange2_r                 312.95s             311.87s                   -0.35%
i557.xz_r                        272.35s             269.74s                   -0.96%
                                                                               ------
--- total time                  7584.32s            7462.32s                 -122.00s
--- geometric mean                                                             -1.68%

@JonPsson1
Copy link
Contributor Author

Benchmark results of latest version ("Simplifications"):

                                main                experiment3 
f503.bwaves_r                    850.58s             850.06s                   -0.06%
f507.cactuBSSN_r                 341.58s             298.53s                  -12.60%
f508.namd_r                      210.27s             205.96s                   -2.05%
f510.parest_r                    511.94s             511.24s                   -0.14%
f511.povray_r                    567.91s             564.75s                   -0.56%
f519.lbm_r                       229.21s             224.75s                   -1.95%
f521.wrf_r                       413.17s             412.63s                   -0.13%
f526.blender_r                   269.10s             261.91s                   -2.67%
f527.cam4_r                      313.08s             312.95s                   -0.04%
f538.imagick_r                   369.30s             321.46s                  -12.95%
f544.nab_r                       356.19s             346.42s                   -2.74%
f549.fotonik3d_r                 359.31s             361.24s                    0.54%
f554.roms_r                      222.44s             222.47s                    0.01%
i500.perlbench_r                 309.78s             307.17s                   -0.84%
i502.gcc_r                       170.44s             169.85s                   -0.35%
i505.mcf_r                       297.10s             295.36s                   -0.59%
i520.omnetpp_r                   210.72s             210.33s                   -0.19%
i523.xalancbmk_r                 182.40s             183.41s                    0.55%
i525.x264_r                      192.64s             192.67s                    0.02%
i531.deepsjeng_r                 262.40s             262.74s                    0.13%
i541.leela_r                     478.42s             479.58s                    0.24%
i548.exchange2_r                 314.44s             311.88s                   -0.81%
i557.xz_r                        272.67s             270.14s                   -0.93%
                                                                              ------
--- total time                  7705.09s            7577.50s                 -127.59s
--- geometric mean                                                             -1.73%

JonPsson and others added 5 commits October 30, 2025 18:07
Final cleanup.
MISched0, with -misched-gprloads.
Simplified and refined.
Remove previous versions.
-tiny-region
Disable pre-ra scheduling.
Revert "Disable pre-ra scheduling."
Try TinyRegion=10 without extra checks.
Revert "Try TinyRegion=10 without extra checks."
Try TinyRegion=10 with some exclusions.
Tiny10Lat without chainpreds and hazard
Was c0f4354
…y when

  either there are many nodes in data-sequences, or when
  IsAcyclicLatencyLimited is true.

- Remove the lengthy handling for tiny regions containing long latency
  instructions - it doesn't appear to be needed.

- Slight refactoring of computeSULivenessScore() with PDiffs (NFC).
Settle for always using PressureDiffs when checking liveness.
Squash.
- The handling of FPd stalls for tiny regions still gives 1% on nab, but that
  isn't enough to motivate it. Wait with this until it works well on all
  regions.

- Only apply one way of treating liveness of register uses in
  computeSULivenessScore(). What was previously UsesLiveAll is dropped and
  UsesLivePrio has become UsesLive.
@atrick
Copy link
Contributor

atrick commented Nov 3, 2025

So with global liveness and better handling of pressure Excess it should do better, but then again it will not matter in many interesting cases where there will be Excess anyway where it will always be a matter of reducing it as much as possible. @atrick - have you tried this?

The generic scheduler was meant to exercise most of the basic MachineScheduler features without being heavily biased in one direction. It was sort of tuned to not perturb the schedule in most cases. At the time, people were hand-tuning C kernels and didn't want LLVM messing it up. I think you can always find a simpler, better strategy for a particular microarchitecture and the workloads the matter most to your platform. You can also do much more aggressive register pressure reduction with cross-block code motion (before MachineScheduling). And you can do multi-pass scheduling with different strategies each time, etc. As long as you can justify maintaining that extra logic and spending compilation time on it.

Help Cmp0 elim.
Fixes after rebase. Refactoring.
Tests updated.

Try without TinyRegion (Cmp-0, PReg COPYs)
IsSchedLowCand()
Experiments with ShouldReduceLiveness.
Was c13f7bc (e2dbbe7)
@JonPsson1
Copy link
Contributor Author

JonPsson1 commented Nov 3, 2025

Rebased: The register coalescer has become more active and there are now LAs
defining a PReg (out-arg), which there was not before. biasPhysRegExtra() has been
updated to accept these as well (without the check), and it seems to be no problem.

The hard limit for tiny-region has been replaced by these points:

  • Don't pull down any of 12 highest nodes for the purpose of closing the live range.
  • Demand at least 4 sequenced nodes for latency reduction (two chains or one long chain).
  • Help cmp0 elimination.

This worked well when enabling register pressure tracking for all regions. GenericScheduler
limits this around ~8 (half the number of allocatable GPRs), for the purpose of compile time
reduction. I have not spent that much time on compile time, but it is clear that this limit
is useful:

Compile time:         Average Wall   Worst   // Building SPEC with -j1 once per clang version.
GenericScheduler  :      1.82%       23.1%  16.8% 13.9% ...
>= 8 instructions :      1.97%       23.0%  20.0% 13.9% ...
all regions       :      2.88%       23.3%  20.7% 13.9% ...

Cmp0 elimination stats over SPEC:

systemz-elim-compare - Number of eliminated comparisons
main:                   191062
patch, without CmpElim: 190827
patch, with CmpElim:    192563

While this version achieves (all seen) performance improvements, there are still possible
follow-ups:

  • try setting up global liveness and fine-tuning liveness reduction with RPTracker.
  • Try using DFS subtrees for liveness reduction.
  • Revisit PRegs (bias, Copy2PRegBias)
  • shouldTrackPressure for all single block loops and acyclic-limited latency reduction?

Tests updated and passing.

@JonPsson1
Copy link
Contributor Author

HighSUs has been improved for compile time:

  • don't grow a potentially huge set when only at most 12 SUs should result.
  • use std::sort on a vector instead of inserting all elements one by one.
  • Handle the case of a region with 12 instructions or less by means of having the set empty.

This brings the average pass time down to 1.905%.

Cases with more >10% compile time seemed to be mostly about buildDAGWithRegPressure.

In a smaller (and more common region) perf indicated isSchedLowCand() as begin slow. I tried to cache the result of computeSULivenessScore(), but to my surpise that was slower. I am still experimenting with different versions but I am not sure there is a reasonable fix - it seems it is on par with the GenericSched and maybe it isn't worth cluttering the code too much...

@JonPsson1
Copy link
Contributor Author

Some experiments with liveness reduction (not committed to the branch), as to when to do / not do it when all uses are live:

  • Trying different values of "topcycles", which affects both the HighSUs set and the HasDistToTop margin: increasing from the default of 2 to 3 and 4 gave worse perf results, while 5 and 6 got better again and then 7 worse again. 5 was best of these with a +0.74% average change across all benchmarks compared to 2. Conclusion: 2 seems best.

  • Replacing HasDistToTop with "SubsumedByRemLat", with the idea of trying to be more precise and do differently with regions containing long latency instructions, instead of just counting the number remaining as a crude margin to the top. Looking at SU and its (closest) data successor as a unit, there is the remaining latency of (other) unscheduled nodes to consider.

           |   D  |
  Succ --> SU
  Succ ---------> SU
       ------------> RemLat
Placing SU closer to Succ means D more (decoding) cycles are added to SU.

computed with "The decoding cycles for scheduling SU next plus its latency is less than the rem latency of the successor":
NumLeft / IssueWidth + SU->Latency < Remaining latency of (closest) data successor.

Counting number of spill/reload (and copys) instructions in SPEC output:

                                 main   "With tiny regions limit"
Spill|Reload   :               532477               528520    -3957        // -0.75%
Copies         :               886962               886644     -318

                                 main   "HighSUs instead of TinyRegion"    (performance ref below)
Spill|Reload   :               532477               530071    -2406
Copies         :               886962               886732     -230

                                 main   "HighSUs, but with Pres/Subsumes"  (similar performance)
Spill|Reload   :               532477               528575    -3902
Copies         :               886962               887234     +272

                                 main   "HighSUs, but with Subsumes only"  (slightly worse perf)
Spill|Reload   :               532477               528468    -4009
Copies         :               886962               886944      -18

                                 main   "No liveness reduction"            (slightly worse perf)
Spill|Reload   :               532477               532461      -16
Copies         :               886962               886785     -177

Conclusion: the liveness reduction heuristic reduces spilling a bit, but performance is not directly in proportion to this alone, showing that it is important to consider other things such as latencies while helping liveness. SubsumedByRemLat is more involved but doesn't give any performance improvement.

  • Another idea was to skip the HasDistToTop and only rely on HighSUs for the top margin. I tried various values of TopCycles (2 - 11), and found that around 6 or 7 this seemed to work fairly well with similar perf results (within 0.1% on average). Conclusion: using TopCycles of 6 as default could work, eliminating the computation of HasDistToTop and also showing that this is in fact mostly useful if used only in regions with at least a few dozen instructions.
                                 main   "No HasDistToTop, TopCycles=6"     (similar performance)
Spill|Reload   :               532477               531009    -1468
Copies         :               886962               886822     -140

@JonPsson1
Copy link
Contributor Author

Have now tried applying a general "TopMargin" of two decoder cycles so that if there are less than that number of SUs left, don't schedule "low" for sake of liveness. This time, I tried that limit generally, meaning also demanding this when the scheduled latency is preserved (PreservesSchedLat). It seems that this may be a slight (perf) improvement with "Preserves/Subsumes", but with the last measured "PreservesSchedLat || HasDistToTop", it is not helpful (slightly worse perf).

@JonPsson1
Copy link
Contributor Author

Did another experiment by removing the extra check that puts two SUs that both have a PRegBias in their source order relative to each other. This still seems possibly useful - one regression around 1% seen, otherwise no changes.

@JonPsson1
Copy link
Contributor Author

I think currently, it seems not quite easy to rewrite tryCandidate() and computeSULivenessScore() so as to avoid the mix-in of scheduled latency into the latter, and then check this again with shouldReduceLatency(). This is because computeSULivenessScore() checks also HasDistToTop if PreservesSchedLat is false.

However, I have done the experiment to remove HasDistToTop (see above) which seems to give similar performance if TopCycles is set to 6 (or maybe 7). TopCycles controls liveness reduction with HighSUs, so it is kind of achieving something similar to HasDistToTop. IIRC values of 5-7 seemed to be <0.1% worse overall, while 4 and 8 ~0.2% worse on average.

So without HasDistToTop it should be more easy to rewrite tryCandidate() around PreservesSchedLat. Do you think this sounds like a good idea to try? @uweigand

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants